-
Notifications
You must be signed in to change notification settings - Fork 842
watchdog: alert on ET_NET thread stalls beyond threshold #12524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Run a watchdog thread to find blocking events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a thread watchdog system to detect blocking events in the traffic server. The watchdog monitors network threads and alerts when they remain awake (not sleeping) longer than a configurable timeout threshold, indicating potential performance issues or blocking operations.
Key changes:
- Introduces a
Watchdog::Monitorclass that runs in a separate thread to monitor event loop health - Adds heartbeat tracking to
EThreadinstances to record sleep/wake timestamps - Configures the watchdog timeout through a new configuration parameter
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/traffic_server/traffic_server.cc | Creates and manages the watchdog instance, integrating it into server startup/shutdown |
| src/records/RecordsConfig.cc | Adds configuration parameter for watchdog timeout |
| src/iocore/eventsystem/Watchdog.cc | Implements the watchdog monitoring logic |
| src/iocore/eventsystem/UnixEThread.cc | Adds heartbeat updates to the event loop |
| src/iocore/eventsystem/CMakeLists.txt | Includes the new watchdog source file in the build |
| include/iocore/eventsystem/Watchdog.h | Defines watchdog interfaces and heartbeat structure |
| include/iocore/eventsystem/EThread.h | Adds heartbeat state to the EThread class |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <[email protected]>
Because ubuntu 20
macOS 14, FreeBSD 13 don't have it
|
This is a PR that would be nicer if we had |
src/traffic_server/traffic_server.cc
Outdated
|
|
||
| // Start the watchdog | ||
| int watchdog_timeout_ms = RecGetRecordInt("proxy.config.thread_watchdog.timeout_ms").value_or(1000); | ||
| watchdog = std::make_unique<Watchdog::Monitor>(eventProcessor.thread_group[ET_NET]._thread, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this optional, as in, if proxy.config.thread_watchdog == 0, we don't setup the watchdog ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
[approve ci autest] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| watchdog = std::make_unique<Watchdog::Monitor>(eventProcessor.thread_group[ET_NET]._thread, | ||
| static_cast<size_t>(eventProcessor.thread_group[ET_NET]._count), | ||
| std::chrono::milliseconds{watchdog_timeout_ms}); |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The watchdog is created immediately after eventProcessor.start() is called (line 2132), but the ET_NET threads may not have fully initialized their heartbeat state yet. Since heartbeat_state members are initialized to sentinel values (time_point::min() and seq{0}), the watchdog should either wait for threads to be ready or the initialization order should be documented to prevent potential timing issues during startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentinel values are chosen this way to tolerate delays in the startup of ET_NET threads. See this code in watchdog.cc:
if (last_sleep == std::chrono::steady_clock::time_point::min()) {
// initial value sentinel - event loop hasn't started
continue;
}
Co-authored-by: Copilot <[email protected]>
Adds a watchdog thread that warns when a net thread remains in the work phase longer than a configurable duration.
Config:
proxy.config.thread_watchdog.timeout_ms (default: 1000)
Why:
Net threads should not stall; doing so adds latency to all transactions multiplexed on that thread. Stalls may indicate a misbehaving plugin, overload, or a Traffic Server bug.
On trigger, a warning is logged with the offending thread number and elapsed time to aid targeted diagnostics.